Analysis of community structure in networks of correlated data 
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We present a reformulation of modularity that allows the analysis of the community structure 
in networks of correlated data. The new modularity preserves the probabilistic semantics of the 
original definition even when the network is directed, weighted, signed, and has self-loops. This is 
the most general condition one can find in the study of any network, in particular those defined 
from correlated data. We apply our results to a real network of correlated data between stores in 
the city of Lyon (France). 
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I. INTRODUCTION 

Complex networks are graphs representative of the in- 
tricate connections between elements in many natural 
and artificial systems P, 0, [H, Q, whose description in 
terms of statistical properties have been largely devel- 
oped looking for a universal classification of them. How- 
ever, when the networks are locally analyzed, some char- 
acteristics that become partially hidden in the global sta- 
tistical description emerge. The most relevant is perhaps 
the discovery in many of them of community structure, 
meaning the existence of densely (or strongly) connected 
groups of nodes, with sparse (or weak) connections be- 
tween these groups 

The study of the community structure helps to elu- 
cidate the organization of the networks and, eventually, 
could be related to the functionality of groups of nodes 
[f|. The most successful solutions to the community de- 
tection problem, in terms of accuracy and computational 
cost required, are those based in the optimization of a 
quality function called modularity proposed by Newman 
and Girvan Q that allows the comparison of different 
partitioning of the network. The extension of modularity 
to weighted [1] and directed networks [§, [l(| has been the 
first steps towards the analysis of the community struc- 
ture in general networks. 

Very often networks are defined from correlation data 
between elements. The common analysis of correlation 
matrices uses classical or advanced statistical techniques 
. Nevertheless an alternative analysis in terms of net- 
works is possible. The network approach usually consists 
in filtering the correlation data matrix, by eliminating 
poorly correlated pairs according to a threshold, and by 
keeping unsigned the value of the correlation, produc- 
ing a network of positive links and no self-loops (self- 
correlations). Recently, some authors pointed out the 
possibility to analyze these networks via spectral decom- 
position [12|,[II[- We devise also the possibility to analyze 
them in terms of Newman's modularity to reveal the com- 
munity structure (clusters) of the correlated data. How- 



ever, any of these approaches can be misleading because 
of two facts: first, the sign of the correlation is impor- 
tant to avoid the mixing of correlated and anti-correlated 
data, and second, the existence of self-loops is critical for 
the determination of the community structure [9(. Here 
we propose a method to extract the community struc- 
ture in networks of correlated data, that accounts for 
the existence of signed correlations and self-correlations, 
preserving the original information. To this end, we ex- 
tend the modularity to the most general case of directed, 
weighted and signed links. We will show the performance 
of our method in a real network of correlations between 
commercial activities, previously analyzed in [l4| using a 
Potts model. 



II. GENERALIZATION OF MODULARITY 

Given an undirected network partitioned into commu- 
nities, the modularity of a given partition is, up to a 
multiplicative constant, the probability of having edges 
falling within groups in the network minus the expected 
probability in an equivalent (null case) network with the 
same number of nodes, and edges placed at random pre- 
serving the nodes' strength, where the strength of a node 
stands for the sum of the weights of its connections [HI . 
In mathematical form, modularity is expressed in terms 
of the weighted adjacency matrix Wy, that represents the 
value of the weight in the link between i and j (0 if no 
link exists), as [15j | 
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where Cj is the community to which node i is assigned, 
the Kronecker delta function 5(Ci, Cj) takes the values, 1 
if nodes i and j are into the same community, otherwise, 
the strengths are = w^j , and the total strength is 

The larger the modularity, the larger the deviation 
from the null case and the better the partitioning. Note 
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FIG. 1: Network with well-defined community structure and 
its correlation matrix. 



that the optimization of the modularity cannot be per- 
formed by exhaustive search since the number of dif- 
ferent partitions are equal to the Bell [HI ] or exponen- 
tial numbers, which grow at least exponentially in the 
number of nodes N. Indeed, optimization of modu- 
larity is a NP-hard (Non-deterministic Polynomial-time 
hard) problem [T3|. Several authors have attacked 
the problem proposing different optimization heuristics 

0, 51 Hi ill, EI S3. 

To demonstrate the flaws of modularity when trying 
to extract the community structure of correlated data 
we show the following example. Suppose we have a net- 
work with a well defined community structure as the one 
presented in Fig. [TJ Let us pretend that each community 
is indeed a functional community, in such a way that 
nodes in every group have different states. To simplify 
the mathematics we will consider that the nodes in com- 
munity A are in a state +1, and nodes in community B 
are in a state —1. After, we define the correlation be- 
tween these data as, for example, Rij = SiSj, Si and 
Sj being the corresponding states of nodes i and j. The 
question is: can we infer communities A and B from the 
correlated data represented in matrix Rl Applying mod- 
ularity, the answer is negative. Let us sketch the proof. 
The matrix R is blockwise composed of submatrices Raa, 
Rab, Rba, and Rbb- The blocks Raa and Rbb are all 
valued +1, and Rab and Rba are valued —1. Any matrix 
of this form results in zero modularity for all partitions, 
since Rij = 



1w 



for all pairs (see Eq. [T]) . 



To reveal the community structure in the network pre- 
sented in Fig. [T] from its correlation matrix, it is nec- 
essary to revise the formulation of modularity. Let us 
suppose that we have a weighted undirected complex net- 
work with weights tuy as above. The relative strength pi 
of a node 
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(2) 



may be interpreted as the probability that this node 
makes links to other ones, if the network were random. 
This is precisely the approach taken by Newman and Gir- 
van to define the modularity null case term, which reads 
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The introduction of negative weights destroys this 
probabilistic interpretation of pi, since in this case the 



values of Pi are not guaranteed to be between zero and 
one. The problem is the implicit hypothesis that there 
is only one unique probability to link nodes, which in- 
volves both positive and negative weights. To solve this 
problem, we have to introduce two different probabilities 
to form links, one for positive and the other for negative 
weights. 

Let us formalize this approach. First, we separate the 
positive and negative weights: 
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where 

wfj = max{0,Wy}, 
= max{0, -Wyj. 

The positive and negative strengths are given by 



and the positive and negative total strengths by 

2w+ = E w * + = EE^< 

i i j 

2w- = E w * r = EE w ^- 

i i j 

Obviously, 



and 
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With these definitions at hand, the connection prob- 
abilities with positive and negative weights are respec- 
tively 
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Now, there are two terms which contribute to modu- 
larity: the first one takes into account the deviation of ac- 
tual positive weights against a null case random network 
given by probabilities pf , and the other is its counterpart 
for negative weights. Thus, it is useful to define 
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The total modularity must be a trade off between the 
tendency of positive weights to form communities and 
that of negative weights to destroy them. If we want that 
Q + and Q~ contribute to modularity proportionally to 
their respective positive and negative strengths, the final 
expression for modularity Q is 



Q = 



2w H 
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An alternative equivalent form for modularity Q is 
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The main properties of Eq. ([18)) are the following: 
without negative weights, the standard modularity is re- 
covered; modularity is zero when all nodes are together 
in one community; and it is antisymmetric in the weights, 
i.e. Q(C, { Wij }) = -Q(C, {-«>«}) . 

The extension to directed networks [24] is simply ob- 
tained by the substitutions 



±,out 
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III. COMPARISON WITH OTHER METHODS 

In Fig. [5] we show a simple example of a network for 
which the original Newman modularity Eq. ([T]) and the 
Potts model in [l4j do not yield the expected partition in 
two communities, whereas our new modularity Eq. (|18p 
succeeds. It consists in two cliques, formed by positive 
links, and connected by two edges, one positive and the 
other negative. All positive links have a weight +1, and 
the negative a weight v < 0. Any size of the cliques 
greater than or equal to three does the job. 

First, the Potts model in [yj is based on a Hamilto- 
nian which only takes into account the difference between 
positive and negative weights within the modules, and is 
equivalent to modularity but without the null case term. 
In the network Fig. [2J if \v\ < 1, the strength between 
the two cliques is 1 + v > 0, thus the Potts model is re- 
warded to join both cliques in the same module. Clearly, 
the absence of the null case is responsible of this incorrect 
result. 

On the other hand, the original definition of modu- 
larity (Eq. [T]), which does include a null case, was not 
designed to cope with negative weights. In this example, 
its optimal partition is again a single module containing 
all the nodes if the value of \v\ is greater than the number 
of positive links. 

In the authors propose an alternative definition of 
modularity for positive and negative links. Their work, 




FIG. 2: Network with two well-defined communities. Solid 
lines correspond to positive links, and the dashed line to the 
only negative link, with weight v < 0. 



also based on a Potts model representation of the network 
communities' assignment |2fS| . is totally compatible with 
the definition found in the current work, and equivalent 
for the values of their parameters A = 7 = 1. 



IV. APPLICATION TO A REAL NETWORK 

We now turn to an example of community structure 
detection using our method in a specific social network. 
We deal with the spatial distribution of retail activities 
in the city of Lyon, thanks to data obtained at the Lyon's 
Commerce Chamber [33[ . We have shown in [3] how to 
transform data on locations into a matrix of correlated 
data, in this case of attractions/repulsions (i.e. positive 
and negative links) between retail activities. To compute 
the interaction between activities A and B, the idea is to 
compare the concentrations of B stores in the neighbor- 
hood of A stores to a reference concentration obtained by 
locating the B stores randomly. To compute the random 
reference, the idea [13] is to locate the B stores on the ar- 
ray of all existing store sites. This is the best way to take 
into account automatically the geographical peculiarities 
of each town. The logarithm of the ratio of the actual 
concentration to the reference concentration gives the in- 
teraction coefficient, which is positive for attractions and 
negative for repulsions, as anticipated. 

More precisely, the (self) interaction of N A A stores 
embedded in a larger set of N t locations is 



N t -1 



< ^ 1 N *- 
aAA(r)=l ° Sw N A jN J 



/ 1= 



N A (A t ) 
tlN t (A t ) ' 



(21) 



where N A (Ai) and N t (Ai) represent the number of A 
stores and the total number of stores in the neighborhood 
of store Ai, i.e. locations at a distance smaller than r. 
Similarly, the coefficient characterizing the spatial distri- 
bution of the Bj around the Ai is 



a>AB(r) = log! 



N t -N A ^A 



N B (Ai) 



N A N B £j N t {Ai) - N A {Ai 



(22) 



where N A (Ai), Ng^Ai) and N t (Ai) are respectively the 
A, B and total number of locations in the neighbor- 
hood of point Ai (not counting A4). Both a AA and <x A b 
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TABLE I: Comparison between the different partitions and 
the Lyon Chamber of Commerce classification. 



optimal partition optimal partition 


of Eq. © 


of Eq. (HHJ 


Rand Index 0.6168 


0.6952 


Jaccard Index 0.1336 


0.1426 


NMI 0.1458 


0.2310 



are defined so that they take value when there are 
no spatial correlations. In the case of the a^g coeffi- 
cient, this means that the local B spatial concentration 
is not perturbed, on average, by the presence of A stores, 
and is equal to the average concentration over the whole 
town, n^ B n a ■ Only coefficients which deviate signifi- 
cantly from 0, using a Montecarlo sampling, are taken 
into account in the adjacency matrix. The final result 
of the analysis of the 11629 stores in Lyon is a directed 
network with 97 nodes (retail activities) and 1131 links, 
715 positive and 416 negative. 

We analyze the community structure of the resulting 
network using the modularity defined in Eq. (|18p . The 
optimization method used is Tabu search [9J that for this 
case gave the highest modularity when compared to oth- 
ers [28j . We perform a comparison between the differ- 
ent partitions obtained optimizing independently Eq. ((T|) 
(resulting in 4 communities) and Eq. (|18[) (resulting in 
6 communities), against the Lyon's Commerce Chamber 
retail activities classification (9 communities predefined). 
The similarity of the first two partitions to the third 
one is measured using three different indices, namely the 
Rand Index [29j , the Jaccard Index |30l| , and the Normal- 
ized Mutual Information (NMI) [3lj](see Table HJ). The 
larger their values, the more similar the partitions are. 
All indices show a better performance of Eq. (fT5|) in re- 
covering the actual communities provided by the Lyon's 
Commerce Chamber. Note that in both modularities we 
have used all the positive and negative links. Therefore, 
the increase in performance can only be attributed to a 
proper use of the information embedded in the links. 

Our method is also helpful to understand the spatial 
organisation of retail stores. To interpret the informa- 
tion conveyed by the network links, we use of the z-score 
[2fj| . The basic idea consists in computing the z-score (Z) 
of the internal strength of each node with respect to the 
average internal strength of the community to which is 
assigned. To be consistent with our approach along the 
paper both quantities should be evaluated consistently 
with the sign of the interactions and with the direction- 
ality of links, then 

±, in/out / ±, in/out \ 
y±, in/out "%int \ int / /OQA 



where subindices 'int' express that links are restricted 
within the community to which node i belongs to, 
'in/out' refer to the direction of links, and (■ • ■) and a are 



TABLE II: Roles of retailers within communities. 
+ attractive + repulsive + attracted + repelled 
Funeral Services Dairy products Gas Station Gas Station 
Sports facility Cake shop Sports facility Flea market 
Car dealer Drugstore Funeral Services Car dealer 



the average and standard deviation of the corresponding 
variables, respectively. 

Using the z-score we can answer some questions about 
the role of nodes in their communities. For example, 
one can study, for each community, which are the most 
attractive retailers (max Z + '° ut ), the most repulsive re- 
tailers (max Z~' out ), the most attracted retailers (max 
Z + ' m ), and the most repelled retailers (max Z~' m ). In 
Table |TT] we show the three highest results of these z- 
scores obtained for the largest community found (34 re- 
tail activities). This group gathers the proximity stores, 
which means mainly food stores. Here are some examples 
of the understanding of the spatial organisation of retail 
stores allowed by our method. Sports facilities and fu- 
neral services are peculiar because they strongly attract 
(and are attracted) by some specific activities that go 
along with them almost systematically, e.g. car repairs 
and small hardware stores. Gas stations enjoy a para- 
doxical situation in this group, since they represent the 
most attracted and the most repelled activity. There is 
an interesting commercial interpretation of this paradox: 
gas stations tend to have the most specific commercial en- 
vironment, strongly attracting some of the group's activ- 
ities (such as supermarkets) and being strongly repelled 
by others which however are in the proximity store group 
(for example, butchers or cake shops stores almost never 
have gas stations close to them). Dairy products and 
cake shops strongly repel some specific of the activities 
that belong to their same group, such as car repairs or 
firm's restaurants. 



V. CONCLUSIONS 

Summarizing, we have proposed a new formulation of 
modularity that allows for the analysis of any complex 
network, in general with links directed, weighted, signed 
and with self-loops, preserving the original probabilis- 
tic semantics of modularity. With this definition one 
can analyze networks arising from correlated data with- 
out necessarily symmetrizing the network, skipping auto- 
correlation or considering only the unsigned value of the 
correlations. We devise that other methods are also likely 
to be appropriate for this task, after its pertinent adapta- 
tion, for example the analysis via clique percolation [32j | , 
or specifically methods based on the minimization of the 
energy function of an equivalent spin glass system, were 
weighted signed links can be interpreted in terms of fer- 
romagnetic and anti-ferromagnetic interactions between 
spins |26j |. 
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We have analyzed within the scope of the new modular- 
ity an interesting model of attraction-repulsion of retail 
stores in a large city, previously reported in [l4j . The 
results overcome those obtained using the original defini- 
tion of modularity when compared to the Lyon Chamber 
of Commerce classification, and also point out the neces- 
sity of defining new roles of nodes based on directionality 
and sign of the weights of links, as we have proposed for 
the z-score. 
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